Red Wine Quality by Kunio Shimizu

I would like to answer the following question using dataset of 1,599 quality ranked wines; Which chemical properties influence the quality of red wines?

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our dataset contains 13 variables with 1,559 observation. (variable X is just the number variable, so technically 12 variables to expline the feature of the wines.)

##                    X        fixed.acidity     volatile.acidity 
##                    0                    0                    0 
##          citric.acid       residual.sugar            chlorides 
##                    0                    0                    0 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                    0                    0                    0 
##                   pH            sulphates              alcohol 
##                    0                    0                    0 
##              quality 
##                    0

There are no missing values in this data set.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

As it can be seen in the histogram above, most of the wine quality falls in to “5” and “6” quality bins. And it gets much more less appearance in “3”, “4”,“7”, and “8” quality bins. Mean quality is 5.636, and mode quality (or most frequent quality level, because quality variable is discrete.) is “5” from the table..

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The distribution of wines on the fix.acidity is looks some what normal, a little bit skwed positively, with median 7.90, and mean of 8.32.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The distribution wines on valatile.acidity is also a bit skwed positively.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The distribution of wine on citric.acid tells that there are large number of wines falls into very small amount of citric acid, but it looks like there also another peaks in just below the 0.5 as well.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of wine on resigual.sugars are very positively skewed by looking at the above histogram. Most of the wine falls in to residual sugar level between 1 and 3, but there are certain numbers of wine exist well beyond 10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The distribution of wines on chlorides, looks similar to above residual sugar distribution. Sine the is the outliers with very high level of chlorides, second plots is zooming into low level of chlorides. And it looks normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The distribution of wine on free.sulfur.dioxide is also looks similar to residual sugar distribution as well.

The above histogram was constructed by taking log10 transformation of free.sulfur.dioxide, because the previous histogram was very skewed to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Again, the distribution of wine on total.sulfur.dioxide is also skewed positively.

The above histogram is also the log10 transformation of total.sulfur.dioxide created, because the original scale histogram was very skewed to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The distribution of wine on density looks very normally distributed with mean on 0.996.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The distribution of wine on pH is also normally distributed, with some outliers around beyond pH level 4.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution of wine on sulphates is very skewed to right, the wine with sulphates level over 2.0 can be considered as outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol level distribution is skewed to the right side, and it looks the counts of wine is decreasing as alcohol level increase.

So I take the natural log transformation to the alcohol, and histogram looks easy to capture the characteristics of the distribution of alcohol level.

Univariate Analysis

What is the structure of your dataset?

There are 1599 quality censored tested red wines in the data set, and 11 attributes. Most of the attributes are main components that determine the quality of wine.

Based on the sensory test, each red wines are graded in 1 to 10 level of quality. Bad 1 <<<<< 5 >>>>>. 10 Good. Quality “10” is best graded red wine, and “1” is worst graded red wine.

Although most of the attributes are some kind of chemical contents that used to determine the quality of wine, the “pH” is only objective scale variable, which describes ow acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic).

Other noticeable points are; 1. Most red wine quality falls into 5 and 6. 2. Following attributes distributions are seems normally distributed; acidity, sugar, chlorodes, density, pH, sulphates. 3. Following attributes distributions are seems skewed to right; citric acid, sulfur, alcohol. 4. There are certain volume of outliers.

What is/are the main feature(s) of interest in your dataset?

The main feature of this red wine dataset is to see how “quality” of the red wine, which is totally subjective scale, is determined by the components in the wine.

If wee can construct the model which is able to predict the quality of the red wine, we could set appropriate price for the red wine, quality control, develop the much better wine, and such more, by looking at the components inside the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Just by thinking what the red wine labels says in ordinary liquor store, sweetness, acidity, tannin, fruit, body, it is seems like, dataset attributes such as acidity, citric acid, sugar, density, and pH might determine the quality of red wine.

And of course, the alcohol could be the one of the important attributes for the quality of wine as well, because too much alcohol is just the spirits and too low alcohol is just the grape juice.

Did you create any new variables from existing variables in the dataset?

Since it is looks like that the “quality” of the wine is discrete scale, I add new quality variable, named “quality.factor”, identical number but data type is factor scale not integer, so that it makes bivariate plot analysis easy.

##  Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I transform, free.sulfur.dioxide and total.sulfur.dioxide to log10 scale, and alcohol since these attributes distribution looks very skewed to right.

Also, I drop the variable X because it looks just the index of the red wine and not relevant to my analysis.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Just by looking at the correlation matrix above, the similar attributes, such as fixed.acidity and citric.acid and pH, two sulfur dioxide has relatively higher correlations.

And by looking at the correlations for quality, it is seems like the alcohol is the highest positive correlation for quality.

Above plot is the boxplots of the quality(factor) on the acidity related attributes.

It is seems like that the only volatile.acidity has clear relationships to quality, less volatile.acidity higher quality.

Although citric.acid seems to have positive relationship to quality by boxplot, but points plot giving me the intuition of too much 0 (or close to 0) value of citric.acid wine messing up the box plot.

Boxplot of residual.sugar on quality is hard to see the relationship because of too much outliers. Therefore, I omit residual.sugar outliers(one tail) based on Q3 + 1.5IQR rules.

By excluding outliers of residual.sugar, the boxplot became easy to see the relationship, but it looks like residual.sugar has clear relationship on quality of red wine.

Again, boxplot of chlorides has also many outliers that makes analysis unclear. Using 1.5IQR rule, omitting outliers of chlorides.

From above plot, it looks like weak negative relationship between chlorides and quality of wine.

Above plots are boxplot of two sulfur.dioxide attributes. It is difficult to see the relationship from these graphs, and because we know these distributions are very skewed right, taking log10 transformation to each sulfur.dioxide attributes.

By looking at the above plots, it can be said that the both free.sulfur.dioxide and total.sulfur.dioxide has weak negative relationship on quality of the red wine.

Boxploting, density, Ph, sulphates, and alcohol

By looking at the above graphs, it is seems like that the density has negative relationships on quality, and no relationship for pH, and alcohol has relatively strong relation ships on quality.

But again, because sulphastes has too many outliers that plot analysis makes difficult, we omit outliers with 1.5 IQR rule.

By omitting outliers, it can be seems that the sulphates might have relatively strong relationship on quality of red wine.

Looking at the correlation matrix above, I plot the following plots to see the relationships for each attributes. I selects attributes which has at least correlation level above 0.5 to any other attributes.

Even though I can see the positive relationship of alcohol and quality from the plot, seems like there are very high concentration of points on the alcohol plot. And remembering I also take the log transformation of alcohol distribution too, I plot log(alcohol) and quality of red wine as well.

Still we can see the strong positive relationship between alcohol and red wine quality.

By looking at the plots, it is looks like that fixed.acidity has positive relationship to citric.acide and density, and negative relationship on pH.

Also, it is expected that the free.sulfur.dioxide and total.sulfur.dioxide had positive correlation each other.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

By observing box-points plot above, these are attributes that can be seem some relationship between the quality of wine and its direction.

Positive relationship on red wine quality - citric.acid - sulphates - alcohol

Negative relationship on red wine quality - volatile.acidity - chlorides - total.sulfur.dioxide

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It is looks like that fixed.acidity has positive relationship to citric.acide and density, and negative relationship on pH.

Also, it is expected that the free.sulfur.dioxide and total.sulfur.dioxide had positive correlation each other.

What was the strongest relationship you found?

By looking at the plot and correlation matrix, it can be said that alcohol has strongest positive relationship to the quality of red wine.

Although, one thing to notice that even there can be seen the positive relationship of alcohol and quality, I can only see that relationship where wine quality is 5 or above.

In other words, I can not see any clear positive relationship of alcohol and quality where the quality is “3” and “4”. (But sample are very small!!.)

Multivariate Plots Section

Taking x as log(alcohol) ant y as citric.acid, both positive relationship attributes to quality, and I colored points depending on the quality of wine.

It is easy to see that, high alcohol and high citric.acid tends to have high quality, and low alcohol and low citric acid has low quality

But I can be seen in the graph that even though the citric.acid level is low, if the alcohol level is high, there are still good quality level of wine exit.

Above point graph is very clear that the high level of both alcohol and suplhates indicate high quality level of red wine.

Again, citric.acid and sulphates are both positive on quality as well.

As we can seen in the plot above, the both negative related attributes to quality fo wine chorides and volatile.acidity has both negative to quality when these are plotted together.

But from the graph, it is looks like volatile.acidity is more negatively correlated to quality of the red wine.

It is looks like the total.sulfur.dioxide might have negative impact on the quality of the red wine, but it might not be case as well by looking at the last plot.

Finally plotting two strongest both negative and positive attributes to the quality of alcohol;

Its clear that high level of alcohol and low level of volatile.acidity definitely gives high quality of wine.

One things that notice that from seeing above plot, it seems that quality “4” appears any kinds of level of attributes, and it seems like there is no relationships to attributes, even though the sample of quality 4 is small.

By extracting only quality “4” and lower wine;

No, it looks there are very weak trends that as alcohol level get higher, quality 4 wine appearance decrease. But it looks like even volatile.acidity decrease, the appearance of quality 4 wine would not decrease for any level of volatile acidity. This is interesting.

Finally, I modify the dataset in the group of each quality of wine, and shows the conditional mean and median for each quality of the wine.

## Source: local data frame [6 x 5]
## 
##   quality.factor median.citric.acid median.sulphates median.alcohol     n
##           (fctr)              (dbl)            (dbl)          (dbl) (int)
## 1              3              0.035            0.545          9.925    10
## 2              4              0.090            0.560         10.000    53
## 3              5              0.230            0.580          9.700   681
## 4              6              0.260            0.640         10.500   638
## 5              7              0.400            0.740         11.500   199
## 6              8              0.420            0.740         12.150    18
## Source: local data frame [6 x 5]
## 
##   quality.factor median.vola.acidity median.chlorides
##           (fctr)               (dbl)            (dbl)
## 1              3               0.845           0.0905
## 2              4               0.670           0.0800
## 3              5               0.580           0.0810
## 4              6               0.490           0.0780
## 5              7               0.370           0.0730
## 6              8               0.370           0.0705
## Variables not shown: median.tot.sulfur.dio (dbl), n (int)

From the table above, I can confirm the previous plot analysis and directions of the attributes effect on quality.

Only variable I am not clear is total.sulfur.dioxide. I shows the negative relationships on quality of red wine, but it seems like low level of total sulfur dioxide also sored low level of quality.

The above plot, I add non-liner lm model (y = poly(x,2)) fitted line on the scatter plot. The line is concave, and it might be the case that the either high level of total.sulfur.dioxide and low one is determine the good and bad of quality of wine, with the combination of the other attributes.

First, I build the liner model (inducing log transformation of some attributes) to see the relationship of each attributes and quality of the wine.

## 
## Calls:
## model1: lm(formula = f1, data = wine)
## model2: lm(formula = f2, data = wine)
## model3: lm(formula = f3, data = wine)
## 
## ================================================================
##                                model1      model2     model3    
## ----------------------------------------------------------------
##   (Intercept)                 21.965      15.238      0.351     
##                              (21.195)    (21.438)    (0.548)    
##   fixed.acidity                0.025       0.032                
##                               (0.026)     (0.026)               
##   volatile.acidity            -1.084***   -1.122***  -1.034***  
##                               (0.121)     (0.120)    (0.101)    
##   citric.acid                 -0.183      -0.233                
##                               (0.147)     (0.146)               
##   residual.sugar               0.016       0.012                
##                               (0.015)     (0.015)               
##   chlorides                   -1.874***   -1.716***  -1.917***  
##                               (0.419)     (0.418)    (0.399)    
##   free.sulfur.dioxide          0.004*                           
##                               (0.002)                           
##   total.sulfur.dioxide        -0.003***                         
##                               (0.001)                           
##   density                    -17.881     -15.288                
##                              (21.633)    (21.601)               
##   pH                          -0.414*     -0.360     -0.455***  
##                               (0.192)     (0.190)    (0.118)    
##   sulphates                    0.916***    0.906***   0.883***  
##                               (0.114)     (0.115)    (0.111)    
##   alcohol                      0.276***                         
##                               (0.026)                           
##   log(free.sulfur.dioxide)                 0.106**    0.119**   
##                                           (0.040)    (0.039)    
##   log(total.sulfur.dioxide)               -0.154***  -0.171***  
##                                           (0.041)    (0.039)    
##   log(alcohol)                             3.004***   3.095***  
##                                           (0.283)    (0.185)    
## ----------------------------------------------------------------
##   R-squared                       0.4         0.4         0.4   
##   adj. R-squared                  0.4         0.4         0.4   
##   sigma                           0.6         0.6         0.6   
##   F                              81.3        80.5       126.1   
##   p                               0.0         0.0         0.0   
##   Log-likelihood              -1569.1     -1572.2     -1573.9   
##   Deviance                      666.4       669.0       670.4   
##   AIC                          3164.3      3170.4      3165.8   
##   BIC                          3234.2      3240.3      3214.2   
##   N                            1599        1599        1599     
## ================================================================

I created three kinds of model, model 1: is just the liner model of all attributes. model 2: is created the skewed attributes transformed by log. model 3: is created by using just statistically significant attributes.

Finally, I also construct 3 same models using new data set, wine.refine, which are “residual.sugar” and “chorides” and “sulphates” outliers removed by 1.5IRQ rules.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   :0.900  
##  1st Qu.: 7.100   1st Qu.:0.3900   1st Qu.:0.0850   1st Qu.:1.900  
##  Median : 7.900   Median :0.5200   Median :0.2400   Median :2.100  
##  Mean   : 8.228   Mean   :0.5258   Mean   :0.2544   Mean   :2.184  
##  3rd Qu.: 9.100   3rd Qu.:0.6350   3rd Qu.:0.4000   3rd Qu.:2.500  
##  Max.   :15.000   Max.   :1.3300   Max.   :0.7500   Max.   :3.600  
##    chlorides      free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.0120   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.0690   1st Qu.: 8.00       1st Qu.: 23.00      
##  Median :0.0780   Median :14.00       Median : 37.00      
##  Mean   :0.0779   Mean   :15.69       Mean   : 44.93      
##  3rd Qu.:0.0865   3rd Qu.:21.00       3rd Qu.: 60.00      
##  Max.   :0.1190   Max.   :57.00       Max.   :165.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.860   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9955   1st Qu.:3.220   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9966   Median :3.320   Median :0.6100   Median :10.10  
##  Mean   :0.9966   Mean   :3.324   Mean   :0.6329   Mean   :10.42  
##  3rd Qu.:0.9976   3rd Qu.:3.410   3rd Qu.:0.7050   3rd Qu.:11.00  
##  Max.   :1.0014   Max.   :4.010   Max.   :0.9900   Max.   :14.00  
##     quality      quality.factor
##  Min.   :3.000   3:  4         
##  1st Qu.:5.000   4: 39         
##  Median :6.000   5:571         
##  Mean   :5.643   6:546         
##  3rd Qu.:6.000   7:156         
##  Max.   :8.000   8: 15
## 'data.frame':    1331 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 15 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 65 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

wine.refine is the new dataset that excludes three obvious outliers from the original data set. Sample size now decreased to 1331 observations.

Constructing similar model from above.

## 
## Calls:
## model4: lm(formula = f4, data = wine.refine)
## model5: lm(formula = f5, data = wine.refine)
## model6: lm(formula = f6, data = wine.refine)
## 
## ================================================================
##                                model4      model5     model6    
## ----------------------------------------------------------------
##   (Intercept)                 62.362*     57.369*     0.449     
##                              (25.985)    (26.148)    (0.596)    
##   fixed.acidity                0.053       0.056*               
##                               (0.028)     (0.028)               
##   volatile.acidity            -0.871***   -0.889***  -0.790***  
##                               (0.129)     (0.128)    (0.106)    
##   citric.acid                 -0.261      -0.285                
##                               (0.156)     (0.154)               
##   residual.sugar               0.044       0.040                
##                               (0.049)     (0.049)               
##   chlorides                   -1.323      -1.468     -2.535*    
##                               (1.301)     (1.303)    (1.212)    
##   free.sulfur.dioxide          0.004                            
##                               (0.002)                           
##   total.sulfur.dioxide        -0.003**                          
##                               (0.001)                           
##   density                    -58.835*    -57.153*               
##                              (26.488)    (26.290)               
##   pH                          -0.457*     -0.440*    -0.615***  
##                               (0.204)     (0.204)    (0.122)    
##   sulphates                    1.821***    1.828***   1.702***  
##                               (0.164)     (0.164)    (0.153)    
##   alcohol                      0.234***                         
##                               (0.032)                           
##   log(free.sulfur.dioxide)                 0.108*     0.131**   
##                                           (0.043)    (0.042)    
##   log(total.sulfur.dioxide)               -0.133**   -0.158***  
##                                           (0.043)    (0.041)    
##   log(alcohol)                             2.498***   2.996***  
##                                           (0.337)    (0.199)    
## ----------------------------------------------------------------
##   R-squared                       0.4         0.4         0.4   
##   adj. R-squared                  0.4         0.4         0.4   
##   sigma                           0.6         0.6         0.6   
##   F                              80.9        80.5       124.6   
##   p                               0.0         0.0         0.0   
##   Log-likelihood              -1214.6     -1216.0     -1220.6   
##   Deviance                      483.4       484.4       487.8   
##   AIC                          2455.1      2457.9      2459.3   
##   BIC                          2522.6      2525.5      2506.0   
##   N                            1331        1331        1331     
## ================================================================

It does not change anything in terms of adjusted R squire. But the BIC is smallest in model 6, so I chose model 6 is best predictive model.

I also plotted true quality of wine and predicted quality of wine by model 6.

It is seems like that model predict quality well. But there are some miss predictions.

Second plot, I transformed predicted quality to discrete factor integer by round by 0 digit.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It is interesting to see that, out of all positively related attributes on wine quality, citric.acid, sulphates, and alcohol, alcohol seems to have strongest positive relationship on red wine quality. Also, it is look like, even other positive attributes are low in score, if the alcohol level is high enough, wine was graded relatively high.

Were there any interesting or surprising interactions between features?

Total.sulur.dioxide, one of the negative attributes on the red wine quality, seems have non-liner distribution on the quality of red wine. Generally I can see as total.sulfur.dioxide gets low, the quality of wine increase, but, when the total.sulfur.dioxide has very low, I can see both good quality wine over 6, and also low quality wine 4.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created 6 models, 3 with full sample, and 3 with outliers excluded sample. Each 3 model consist with simple liner model, log-transformed model, log-transformed with statistically significant attributes. As I suspected in plots, it is look like volatile.acidity, chlorides, sulphates, alcohol, free.sulfur.diocide, and total.sulfur.dioxides t values were significant, and signs (positive or negative) are correct. Last final plot was actually wine quality and predicted quality (rounding up the decimal number).

The R-2 was quite low of 40%, and my prediction only predict in the range of 5,6,7 quality, but generality, prediction and actual quality correspond each other.


Final Plots and Summary

Plot One

Description One

The wine quality scales are from 1 to 10, but most of the wines fall into 5 and 6 quality. The mode quality is 5, median quality is 6, and mean quality is 5.636.

Plot Two

Description Two

Above plot describe the relationship between the quality of wine and these strongest at attributes in it.

Left plot x-axis is the quality of wine, and y-axis is the log of alcohol % content in the wine. Inside plot the box plot describe the mean and percentile of the log of alcohol % content. As it can be seen that as the alcohol level increase, the quality of wine tends to increase. Although, it looks like if the quality level of wine is around 3-4, alcohol % level is high, but it might be misleading because of less sample size.

Right plot has same x-axis but the y axis is the amount of acetic acid in wine, volatile acidity, it is also known, “Vinegar taste”. It is clear to see that the level of “vinegar taste” decrease, the quality of wine increased.

From above plot, plotting two strong related with quality of wine, it is looks like, more alcohol and less vinegar tasted wine tends to have high quality of wine.

Plot Three

Description Three

Last plot describe the relationships between Alcohol level, Vinegar taste, and the quality of wine. X-axis is the level of alcohol, Y-axis is the volatile acidity (Vinegar Taste) and the point are the distribution of wine colored by its quality.

As it was investigated in plot two, Alcohol level has positive effects on quality of wine, and Vinegar Taste (volatile acidity) has negative effects on quality of wine.

It is interesting to see the combination of the two attribution are important to decide the quality of wine.

For example, even if the alcohol level is very high around 2.4%-2.5% (log transformed), there are still tends to see the quality 3 or 4 wines if the level of Vinegar Taste is high.

And in vice versa, even if vinegar tastefulness is low, wine with low alcohol level tends to score 4 or 6 wine.

Therefore, in terms of deciding quality of wine, the combinations of the attributes inside wine is important.

Reflection

Using data set of 1599 wines and its 11 attributes which quality are graded by human from 1 to 10, I first see the distribution of the wine falls into mainly quality grade “5” to “6”.

Then I also investigate the relationship of the attributes with the quality of the red wine, and we found that the some seems have positive or negative effects on the quality of the wine but others not.

The level of the alcohol seems to have strong effects on the quality of the red wine, the higher the alcohol level, better quality of the red wine. Also, we found that volatile.acidity, which is also known as Vinegar Taste, has relatively strong negative effects on the quality of the wine.

Finally I develop the simple liner regression model to predict the quality of wine. The model seems to fit, but it can only explain the roughly 40% of the deviation. I also plot actual wine quality and predicted wine quality together. Even actual quality of the wine vary from 3 to 8 quality, model can only predict 4 to 6 quality.

Even my prediction model is limited accuracy, I can somewhat predict if the quality of of the wine would be better or worse than others by using model.

I also successfully plot to see the interesting relationship with wine quality and some of the attribute and can see the clear pattern.

However, because of the very unique distribution of wine in terms of quality, most of the red wine quality was 5 and 6, it was very difficult to see the more precise patterns or model.

Our independent variable, “quality”, is very obscure measurement, and I found very difficult to compare or construct model with objective numeric variable, such as attributes in the wines.

Althoght I construct the model using simple liner regression model, the variable like, “Quality”, which is more like subjective measurement, might not be the simple liner addition of attributes in wine. We might to construct more complex model such as non-liner or classification model.

Also, during my investigation, plotting and modeling, I omitted some outliers using 1.5IRQ rules to make plot to easy to see. But, If the there are new wine which as such outliers level of attributes, my investigation or model can not predict that quality as well. And it is seems like that these wine can have scarcity value and might be very high in quality or not.

Finally, if would like to explore more analysis on these red wine quality dataset, I would like to add 2 variables to know more about the red wine. First one is price of the wine; as I mentioned above, with subjective value like quality and mainly in 5 and 6 quality, replacing or adding price of the wine could see more details on the quality of wine (guessing higher the quality is more the expensive wine is). Second is the age of the wine; I think the taste of the wine could be change depending on how old is it, and see the relationship with quality and attributes in the wine, I think I can create more precise model that can predict quality or the price of wine.